Mini Project (Unsupervised Learning)

Principal Component Analysis of Wine Data

Steps involved

PCA involves following broad level steps –

1. Standardize the d-dimensional dataset.
2. Construct the covariance matrix.
3. Decompose the covariance matrix into its eigenvectors and eigenvalues.
4. Select k eigenvectors that correspond to the k largest eigenvalues, where k is the dimensionality of the new   feature subspace (k≤d)
5. Construct a projection matrix W from the "top" k eigenvectors.
6. Transform the d-dimensional input dataset x using the projection matrix W to obtain the new k-dimensional feature subspace



In [37]:

    
# Import the modules
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from scipy.stats import zscore



In [7]:

    
# Read the dataset
dataset = pd.read_csv("Datasets/wine.data", header=None)



In [8]:

    
# Descriptive analytics
print("Shape of the dataset: ", dataset.shape)









    



Shape of the dataset:  (178, 14)



In [10]:

    
# Displaying the top 5 rows of the dataset
dataset.head(5)



In [17]:

    
# Check for null values
dataset.isnull().values.sum()









    Out[17]:





0

Attribute definition

1st attribute is class identifier (1-3). Other attributes are below:

Alcohol
Malic acid
Ash
Alcalinity of ash
Magnesium
Total phenols
Flavanoids
Nonflavanoid phenols
Proanthocyanins
Color intensity
Hue
OD280/OD315 of diluted wines
Proline

So we will consider 13 attributes for PCA.



In [18]:

    
# Excluding first attribute
X = dataset.iloc[:, 1:].values

Standardizing the 13-dimensional dataset

We will use the StandardScaler class from sklearn.preprocessing library to standardize the dataset.



In [22]:

    
# Standardize the dataset
sc_X = StandardScaler()
X_std = sc_X.fit_transform(X)



In [25]:

    
# Display the standardized dataset
X_std[:3, :]









    Out[25]:





array([[ 1.51861254, -0.5622498 ,  0.23205254, -1.16959318,  1.91390522,
         0.80899739,  1.03481896, -0.65956311,  1.22488398,  0.25171685,
         0.36217728,  1.84791957,  1.01300893],
       [ 0.24628963, -0.49941338, -0.82799632, -2.49084714,  0.01814502,
         0.56864766,  0.73362894, -0.82071924, -0.54472099, -0.29332133,
         0.40605066,  1.1134493 ,  0.96524152],
       [ 0.19687903,  0.02123125,  1.10933436, -0.2687382 ,  0.08835836,
         0.80899739,  1.21553297, -0.49840699,  2.13596773,  0.26901965,
         0.31830389,  0.78858745,  1.39514818]])

Constructing the covariance matrix

We will use cov function from numpy module to find the covariance matrix.



In [33]:

    
cov_matrix = np.cov(X_std.transpose())



In [49]:

    
# Print the covariance matrix
plt.figure(figsize=(15, 15))
sns.heatmap(cov_matrix, annot=True, cmap="Greens")









    Out[49]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f333770e8d0>



In [51]:

    
# Pair plot for this dataset
sns.pairplot(pd.DataFrame(X_std))









    Out[51]:





<seaborn.axisgrid.PairGrid at 0x7f3337383860>

Decomposing the covariance matrix into its eigenvectors and eigenvalues

To do this we will use linalg.eig from numpy module.



In [52]:

    
# Converting to eigen values and eigen vectors
eig_vals, eig_vecs = np.linalg.eig(cov_matrix)



In [68]:

    
0.04166172048200001 ** 0.5









    Out[68]:





0.20411202924374647



In [75]:

    
# Display the eigen Vectors
print("Eigen Vectors:")
pd.DataFrame(eig_vecs)









    



Eigen Vectors






    Out[75]:







  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
    
  
  
    
      0
      -0.144329
      0.483652
      -0.207383
      0.017856
      -0.265664
      0.213539
      0.056396
      -0.014970
      0.396139
      -0.266286
      -0.508619
      -0.225917
      0.211605
    
    
      1
      0.245188
      0.224931
      0.089013
      -0.536890
      0.035214
      0.536814
      -0.420524
      -0.025964
      0.065827
      0.121696
      0.075283
      0.076486
      -0.309080
    
    
      2
      0.002051
      0.316069
      0.626224
      0.214176
      -0.143025
      0.154475
      0.149171
      0.141218
      -0.170260
      -0.049622
      0.307694
      -0.498691
      -0.027125
    
    
      3
      0.239320
      -0.010591
      0.612080
      -0.060859
      0.066103
      -0.100825
      0.286969
      -0.091683
      0.427970
      -0.055743
      -0.200449
      0.479314
      0.052799
    
    
      4
      -0.141992
      0.299634
      0.130757
      0.351797
      0.727049
      0.038144
      -0.322883
      -0.056774
      -0.156361
      0.062220
      -0.271403
      0.071289
      0.067870
    
    
      5
      -0.394661
      0.065040
      0.146179
      -0.198068
      -0.149318
      -0.084122
      0.027925
      0.463908
      -0.405934
      -0.303882
      -0.286035
      0.304341
      -0.320131
    
    
      6
      -0.422934
      -0.003360
      0.150682
      -0.152295
      -0.109026
      -0.018920
      0.060685
      -0.832257
      -0.187245
      -0.042899
      -0.049578
      -0.025694
      -0.163151
    
    
      7
      0.298533
      0.028779
      0.170368
      0.203301
      -0.500703
      -0.258594
      -0.595447
      -0.114040
      -0.233285
      0.042352
      -0.195501
      0.116896
      0.215535
    
    
      8
      -0.313429
      0.039302
      0.149454
      -0.399057
      0.136860
      -0.533795
      -0.372139
      0.116917
      0.368227
      -0.095553
      0.209145
      -0.237363
      0.134184
    
    
      9
      0.088617
      0.529996
      -0.137306
      -0.065926
      -0.076437
      -0.418644
      0.227712
      0.011993
      -0.033797
      0.604222
      -0.056218
      0.031839
      -0.290775
    
    
      10
      -0.296715
      -0.279235
      0.085222
      0.427771
      -0.173615
      0.105983
      -0.232076
      0.089889
      0.436624
      0.259214
      -0.085828
      -0.048212
      -0.522399
    
    
      11
      -0.376167
      -0.164496
      0.166005
      -0.184121
      -0.101161
      0.265851
      0.044764
      0.156718
      -0.078108
      0.600959
      -0.137227
      0.046423
      0.523706
    
    
      12
      -0.286752
      0.364903
      -0.126746
      0.232071
      -0.157869
      0.119726
      -0.076805
      -0.014447
      0.120023
      -0.079402
      0.575786
      0.539270
      0.162116



In [74]:

    
# Display the eigen values
print("Eigen Values:")
pd.DataFrame(eig_vals).transpose()









    



Eigen Values






    Out[74]:







  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
    
  
  
    
      0
      4.732437
      2.511081
      1.454242
      0.924166
      0.858049
      0.645282
      0.554141
      0.103962
      0.350466
      0.169724
      0.290512
      0.227064
      0.25232

Selecting k eigenvectors

Select k eigenvectors that correspond to the k largest eigenvalues, where k is the dimensionality of the new feature subspace (k≤d)

Lets select the first 7 eigen vectors where eigenvalues >= 0.5



In [76]:

    
eig_vecs_selected = eig_vecs[:7, :7]



In [78]:

    
# Display the eigen Vectors
print("First 7 Eigen Vectors:")
pd.DataFrame(eig_vecs_selected)









    



First 7 Eigen Vectors:






    Out[78]:







  
    
      
      0
      1
      2
      3
      4
      5
      6
    
  
  
    
      0
      -0.144329
      0.483652
      -0.207383
      0.017856
      -0.265664
      0.213539
      0.056396
    
    
      1
      0.245188
      0.224931
      0.089013
      -0.536890
      0.035214
      0.536814
      -0.420524
    
    
      2
      0.002051
      0.316069
      0.626224
      0.214176
      -0.143025
      0.154475
      0.149171
    
    
      3
      0.239320
      -0.010591
      0.612080
      -0.060859
      0.066103
      -0.100825
      0.286969
    
    
      4
      -0.141992
      0.299634
      0.130757
      0.351797
      0.727049
      0.038144
      -0.322883
    
    
      5
      -0.394661
      0.065040
      0.146179
      -0.198068
      -0.149318
      -0.084122
      0.027925
    
    
      6
      -0.422934
      -0.003360
      0.150682
      -0.152295
      -0.109026
      -0.018920
      0.060685

	0	1	2	3	4	5	6	7	8	9	10	11	12	13
0	1	14.23	1.71	2.43	15.6	127	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065
1	1	13.20	1.78	2.14	11.2	100	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050
2	1	13.16	2.36	2.67	18.6	101	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185
3	1	14.37	1.95	2.50	16.8	113	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480
4	1	13.24	2.59	2.87	21.0	118	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735

	0	1	2	3	4	5	6	7	8	9	10	11	12
0	-0.144329	0.483652	-0.207383	0.017856	-0.265664	0.213539	0.056396	-0.014970	0.396139	-0.266286	-0.508619	-0.225917	0.211605
1	0.245188	0.224931	0.089013	-0.536890	0.035214	0.536814	-0.420524	-0.025964	0.065827	0.121696	0.075283	0.076486	-0.309080
2	0.002051	0.316069	0.626224	0.214176	-0.143025	0.154475	0.149171	0.141218	-0.170260	-0.049622	0.307694	-0.498691	-0.027125
3	0.239320	-0.010591	0.612080	-0.060859	0.066103	-0.100825	0.286969	-0.091683	0.427970	-0.055743	-0.200449	0.479314	0.052799
4	-0.141992	0.299634	0.130757	0.351797	0.727049	0.038144	-0.322883	-0.056774	-0.156361	0.062220	-0.271403	0.071289	0.067870
5	-0.394661	0.065040	0.146179	-0.198068	-0.149318	-0.084122	0.027925	0.463908	-0.405934	-0.303882	-0.286035	0.304341	-0.320131
6	-0.422934	-0.003360	0.150682	-0.152295	-0.109026	-0.018920	0.060685	-0.832257	-0.187245	-0.042899	-0.049578	-0.025694	-0.163151
7	0.298533	0.028779	0.170368	0.203301	-0.500703	-0.258594	-0.595447	-0.114040	-0.233285	0.042352	-0.195501	0.116896	0.215535
8	-0.313429	0.039302	0.149454	-0.399057	0.136860	-0.533795	-0.372139	0.116917	0.368227	-0.095553	0.209145	-0.237363	0.134184
9	0.088617	0.529996	-0.137306	-0.065926	-0.076437	-0.418644	0.227712	0.011993	-0.033797	0.604222	-0.056218	0.031839	-0.290775
10	-0.296715	-0.279235	0.085222	0.427771	-0.173615	0.105983	-0.232076	0.089889	0.436624	0.259214	-0.085828	-0.048212	-0.522399
11	-0.376167	-0.164496	0.166005	-0.184121	-0.101161	0.265851	0.044764	0.156718	-0.078108	0.600959	-0.137227	0.046423	0.523706
12	-0.286752	0.364903	-0.126746	0.232071	-0.157869	0.119726	-0.076805	-0.014447	0.120023	-0.079402	0.575786	0.539270	0.162116